On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.
Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.
Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
# %pip install --upgrade plotly
# !pip3 install --upgrade seaborn
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:,.2f}'.format
df_data = pd.read_csv('nobel_prize_data.csv')
Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.
Challenge: Preliminary data exploration.
df_data? How many rows and columns?df_data.shape
(962, 16)
df_data.columns
Index(['year', 'category', 'prize', 'motivation', 'prize_share',
'laureate_type', 'full_name', 'birth_date', 'birth_city',
'birth_country', 'birth_country_current', 'sex', 'organization_name',
'organization_city', 'organization_country', 'ISO'],
dtype='object')
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE |
df_data.year.min()
1901
df_data.year.max()
2020
Challange:
df_data.duplicated().values.any()
False
df_data.isna().sum()
year 0 category 0 prize 0 motivation 88 prize_share 0 laureate_type 0 full_name 0 birth_date 28 birth_city 31 birth_country 28 birth_country_current 28 sex 28 organization_name 255 organization_city 255 organization_country 254 ISO 28 dtype: int64
col_subset = ['year','category', 'laureate_type',
'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset].head()
| year | category | laureate_type | birth_date | full_name | organization_name | |
|---|---|---|---|---|---|---|
| 24 | 1904 | Peace | Organization | NaN | Institut de droit international (Institute of ... | NaN |
| 60 | 1910 | Peace | Organization | NaN | Bureau international permanent de la Paix (Per... | NaN |
| 89 | 1917 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
| 200 | 1938 | Peace | Organization | NaN | Office international Nansen pour les Réfugiés ... | NaN |
| 215 | 1944 | Peace | Organization | NaN | Comité international de la Croix Rouge (Intern... | NaN |
df_data.loc[df_data.organization_name.isna()][col_subset].head()
| year | category | laureate_type | birth_date | full_name | organization_name | |
|---|---|---|---|---|---|---|
| 1 | 1901 | Literature | Individual | 1839-03-16 | Sully Prudhomme | NaN |
| 3 | 1901 | Peace | Individual | 1822-05-20 | Frédéric Passy | NaN |
| 4 | 1901 | Peace | Individual | 1828-05-08 | Jean Henry Dunant | NaN |
| 7 | 1902 | Literature | Individual | 1817-11-30 | Christian Matthias Theodor Mommsen | NaN |
| 9 | 1902 | Peace | Individual | 1843-05-21 | Charles Albert Gobat | NaN |
Challenge:
birth_date column to Pandas Datetime objectsshare_pct which has the laureates' share as a percentage in the form of a floating-point number.df_data.birth_date = pd.to_datetime(df_data.birth_date)
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null datetime64[ns] 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object dtypes: datetime64[ns](1), int64(1), object(14) memory usage: 120.4+ KB
separated_values = df_data.prize_share.str.split('/', expand=True)
numerator = pd.to_numeric(separated_values[0])
denomenator = pd.to_numeric(separated_values[1])
df_data['share_pct'] = numerator / denomenator
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 |
Challenge: Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?
biology = df_data.sex.value_counts()
biology
Male 876 Female 58 Name: sex, dtype: int64
fig = px.pie(labels=biology.index,
values=biology.values,
names=biology.index,
title="Percentage of Male v/s Female Winners",
hole=0.5)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='label+percent')
fig.show()
Challenge:
birth_country? Were they part of an organisation?df_data.loc[df_data.sex == "Female"][:3]
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 1903 | Physics | The Nobel Prize in Physics 1903 | "in recognition of the extraordinary services ... | 1/4 | Individual | Marie Curie, née Sklodowska | 1867-11-07 | Warsaw | Russian Empire (Poland) | Poland | Female | NaN | NaN | NaN | POL | 0.25 |
| 29 | 1905 | Peace | The Nobel Peace Prize 1905 | NaN | 1/1 | Individual | Baroness Bertha Sophie Felicita von Suttner, n... | 1843-06-09 | Prague | Austrian Empire (Czech Republic) | Czech Republic | Female | NaN | NaN | NaN | CZE | 1.00 |
| 51 | 1909 | Literature | The Nobel Prize in Literature 1909 | "in appreciation of the lofty idealism, vivid ... | 1/1 | Individual | Selma Ottilia Lovisa Lagerlöf | 1858-11-20 | Mårbacka | Sweden | Sweden | Female | NaN | NaN | NaN | SWE | 1.00 |
Challenge: Did some people get a Nobel Prize more than once? If so, who were they?
is_winner = df_data.duplicated(subset=["full_name"], keep=False)
multiple_winners = df_data[is_winner]
multiple_winners.full_name.nunique()
6
col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]
| year | category | laureate_type | full_name | |
|---|---|---|---|---|
| 18 | 1903 | Physics | Individual | Marie Curie, née Sklodowska |
| 62 | 1911 | Chemistry | Individual | Marie Curie, née Sklodowska |
| 89 | 1917 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 215 | 1944 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 278 | 1954 | Chemistry | Individual | Linus Carl Pauling |
| 283 | 1954 | Peace | Organization | Office of the United Nations High Commissioner... |
| 297 | 1956 | Physics | Individual | John Bardeen |
| 306 | 1958 | Chemistry | Individual | Frederick Sanger |
| 340 | 1962 | Peace | Individual | Linus Carl Pauling |
| 348 | 1963 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 424 | 1972 | Physics | Individual | John Bardeen |
| 505 | 1980 | Chemistry | Individual | Frederick Sanger |
| 523 | 1981 | Peace | Organization | Office of the United Nations High Commissioner... |
Challenge:
Aggrnyl to colour the chart, but don't show a color axis.df_data.category.nunique()
6
prizes_per_category = df_data.category.value_counts()
prizes_per_category
Medicine 222 Physics 216 Chemistry 186 Peace 135 Literature 117 Economics 86 Name: category, dtype: int64
v_bar = px.bar(x=prizes_per_category.index,
y=prizes_per_category.values,
color=prizes_per_category.values,
color_continuous_scale="Aggrnyl",
title="Number of Prizes Awarded per Category")
v_bar.update_layout(xaxis_title="Nobel Prize Category",
yaxis_title="Number of Awards",
coloraxis_showscale=False)
v_bar.show()
Challenge:
df_data[df_data.category == "Economics"].year.min()
1969
df_data.query('category == "Economics" and year == 1969')
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 393 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Jan Tinbergen | 1903-04-12 | the Hague | Netherlands | Netherlands | Male | The Netherlands School of Economics | Rotterdam | Netherlands | NLD | 0.50 |
| 394 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Ragnar Frisch | 1895-03-03 | Oslo | Norway | Norway | Male | University of Oslo | Oslo | Norway | NOR | 0.50 |
Challenge: Create a plotly bar chart that shows the split between men and women by category.
cat_men_women = df_data.groupby(["category", "sex"], as_index=False).agg({"prize": pd.Series.count})
cat_men_women.sort_values("prize", inplace=True, ascending=False)
cat_men_women.head()
| category | sex | prize | |
|---|---|---|---|
| 11 | Physics | Male | 212 |
| 7 | Medicine | Male | 210 |
| 1 | Chemistry | Male | 179 |
| 5 | Literature | Male | 101 |
| 9 | Peace | Male | 90 |
v_bar_split = px.bar(x=cat_men_women.category,
y=cat_men_women.prize,
color=cat_men_women.sex,
title="Number of Prizes Awarded per Category split by Gender")
v_bar_split.update_layout(xaxis_title="Noble Prize Category",
yaxis_title="Number of Prizes")
v_bar_split.show()
Challenge: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.

dogerblue while the rolling average is coloured in crimson. 
prizes_per_year = df_data.groupby("year").count().prize
prizes_per_year
year
1901 6
1902 7
1903 7
1904 6
1905 5
..
2016 11
2017 12
2018 13
2019 14
2020 12
Name: prize, Length: 117, dtype: int64
moving_average = prizes_per_year.rolling(window=5).mean()
moving_average
year
1901 NaN
1902 NaN
1903 NaN
1904 NaN
1905 6.20
...
2016 11.60
2017 12.00
2018 12.00
2019 12.20
2020 12.40
Name: prize, Length: 117, dtype: float64
plt.figure(figsize=(8, 4), dpi=200)
plt.scatter(x=prizes_per_year.index,
y=prizes_per_year.values,
color="dodgerblue",
alpha=0.6,
s=100)
plt.plot(moving_average.index,
moving_average.values,
color="crimson",
linewidth=3,)
plt.show()
plt.figure(figsize=(8, 4), dpi=200)
plt.title("Number of Nobel Prizes per Year", fontsize=10)
plt.yticks(fontsize=8)
plt.xticks(ticks=np.arange(1900, 2021, step=5), fontsize=8, rotation=45)
plt.grid(color="gray", linestyle="--")
ax1 = plt.gca()
ax2 = plt.twinx()
ax1.set_xlim(1900, 2020)
ax1.scatter(x=prizes_per_year.index,
y=prizes_per_year.values,
color="dodgerblue",
alpha=0.6,
s=60)
ax1.plot(moving_average.index,
moving_average.values,
color="crimson",
linewidth=2,)
ax1.set_xlabel("Year")
ax1.set_ylabel("Number of Nobel Prizes")
plt.show()
Challenge: Investigate if more prizes are shared than before.
yearly_avg_share = df_data.groupby(by='year').agg({'share_pct': pd.Series.mean})
yearly_avg_share.head()
| share_pct | |
|---|---|
| year | |
| 1901 | 0.83 |
| 1902 | 0.71 |
| 1903 | 0.71 |
| 1904 | 0.83 |
| 1905 | 1.00 |
share_moving_average = yearly_avg_share.rolling(window=5).mean()
share_moving_average.head()
| share_pct | |
|---|---|
| year | |
| 1901 | NaN |
| 1902 | NaN |
| 1903 | NaN |
| 1904 | NaN |
| 1905 | 0.82 |
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx() # create second y-axis
ax1.set_xlim(1900, 2020)
ax1.scatter(x=prizes_per_year.index,
y=prizes_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax1.plot(prizes_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
# Adding prize share plot on second axis
ax2.plot(prizes_per_year.index,
share_moving_average.values,
c='grey',
linewidth=3,)
ax1.set_xlabel("Year")
ax1.set_ylabel("Number of Nobel Prizes")
ax2.set_ylabel("Moving Average of Prize Share")
plt.show()
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.set_xlim(1900, 2020)
# Can invert axis
ax2.invert_yaxis()
ax1.scatter(x=prizes_per_year.index,
y=prizes_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax1.plot(prizes_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
ax2.plot(prizes_per_year.index,
share_moving_average.values,
c='grey',
linewidth=3,)
plt.show()
Challenge:
top20_countries that has the two columns. The prize column should contain the total number of prizes won. 
birth_country, birth_country_current or organization_country? birth_country or any of the others? Which column is the least problematic? Then use plotly to create a horizontal bar chart showing the number of prizes won by each country.
What is the ranking for the top 20 countries in terms of the number of prizes?
top20_countries = df_data.groupby("birth_country_current", as_index=False).agg(
{"prize":pd.Series.count}).sort_values("prize")[-20:]
top20_countries.head()
| birth_country_current | prize | |
|---|---|---|
| 7 | Belgium | 9 |
| 31 | Hungary | 9 |
| 33 | India | 9 |
| 2 | Australia | 10 |
| 20 | Denmark | 12 |
h_bar = px.bar(x=top20_countries.prize,
y=top20_countries.birth_country_current,
orientation='h',
color=top20_countries.prize,
color_continuous_scale='Viridis',
title='Top 20 Countries by Number of Prizes')
h_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Country',
coloraxis_showscale=False)
h_bar.show()
Create this choropleth map using the plotly documentation:
Experiment with plotly's available colours. I quite like the sequential colour matter on this map.
Hint: You'll need to use a 3 letter country code for each country.
df_countries = df_data.groupby(["birth_country_current", "ISO"], as_index=False).agg({"prize":pd.Series.count})
df_countries
| birth_country_current | ISO | prize | |
|---|---|---|---|
| 0 | Algeria | DZA | 2 |
| 1 | Argentina | ARG | 4 |
| 2 | Australia | AUS | 10 |
| 3 | Austria | AUT | 18 |
| 4 | Azerbaijan | AZE | 1 |
| ... | ... | ... | ... |
| 74 | United States of America | USA | 281 |
| 75 | Venezuela | VEN | 1 |
| 76 | Vietnam | VNM | 1 |
| 77 | Yemen | YEM | 1 |
| 78 | Zimbabwe | ZWE | 1 |
79 rows × 3 columns
world_map = px.choropleth(df_countries,
locations='ISO',
color='prize',
hover_name='birth_country_current',
color_continuous_scale=px.colors.sequential.matter)
world_map.update_layout(coloraxis_showscale=True,)
world_map.show()
Challenge: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes.
The hard part is preparing the data for this chart!
Hint: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

cat_country = df_data.groupby(['birth_country_current', 'category'],
as_index=False).agg({'prize': pd.Series.count})
cat_country.sort_values(by='prize', ascending=False, inplace=True)
cat_country[cat_country.birth_country_current == "India"]
| birth_country_current | category | prize | |
|---|---|---|---|
| 89 | India | Economics | 2 |
| 90 | India | Literature | 2 |
| 91 | India | Medicine | 2 |
| 88 | India | Chemistry | 1 |
| 92 | India | Peace | 1 |
| 93 | India | Physics | 1 |
merged_df = pd.merge(cat_country, top20_countries, on='birth_country_current')
# change column names
merged_df.columns = ['birth_country_current', 'category', 'cat_prize', 'total_prize']
merged_df.sort_values(by='total_prize', inplace=True)
merged_df.head()
| birth_country_current | category | cat_prize | total_prize | |
|---|---|---|---|---|
| 109 | India | Physics | 1 | 9 |
| 108 | India | Peace | 1 | 9 |
| 88 | Belgium | Peace | 3 | 9 |
| 89 | Belgium | Medicine | 3 | 9 |
| 90 | Belgium | Chemistry | 1 | 9 |
cat_cntry_bar = px.bar(x=merged_df.cat_prize,
y=merged_df.birth_country_current,
color=merged_df.category,
orientation='h',
title='Top 20 Countries by Number of Prizes and Category')
cat_cntry_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Country')
cat_cntry_bar.show()
birth_country_current of the winner to calculate this. prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
prize_by_year
| year | birth_country_current | prize | |
|---|---|---|---|
| 118 | 1901 | France | 2 |
| 346 | 1901 | Poland | 1 |
| 159 | 1901 | Germany | 1 |
| 312 | 1901 | Netherlands | 1 |
| 440 | 1901 | Switzerland | 1 |
| ... | ... | ... | ... |
| 31 | 2019 | Austria | 1 |
| 221 | 2020 | Germany | 1 |
| 622 | 2020 | United States of America | 7 |
| 533 | 2020 | United Kingdom | 2 |
| 158 | 2020 | France | 1 |
627 rows × 3 columns
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current',
'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
| birth_country_current | year | prize | |
|---|---|---|---|
| 0 | Algeria | 1957 | 1 |
| 1 | Algeria | 1997 | 2 |
| 2 | Argentina | 1936 | 1 |
| 3 | Argentina | 1947 | 2 |
| 4 | Argentina | 1980 | 3 |
| ... | ... | ... | ... |
| 622 | United States of America | 2020 | 281 |
| 623 | Venezuela | 1980 | 1 |
| 624 | Vietnam | 1973 | 1 |
| 625 | Yemen | 2011 | 1 |
| 626 | Zimbabwe | 1960 | 1 |
627 rows × 3 columns
l_chart = px.line(cumulative_prizes,
x='year',
y='prize',
color='birth_country_current',
hover_name='birth_country_current')
l_chart.update_layout(xaxis_title='Year',
yaxis_title='Number of Prizes')
l_chart.show()
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 |
Challenge: Create a bar chart showing the organisations affiliated with the Nobel laureates.
df_organisations = df_data.groupby("organization_name", as_index=False).prize.count()
df_organisations
| organization_name | prize | |
|---|---|---|
| 0 | A.F. Ioffe Physico-Technical Institute | 1 |
| 1 | Aarhus University | 1 |
| 2 | Academy of Sciences | 3 |
| 3 | Amsterdam University | 2 |
| 4 | Argonne National Laboratory | 1 |
| ... | ... | ... |
| 259 | École Nationale Supérieur des Mines de Paris | 1 |
| 260 | École Normale Supérieure | 1 |
| 261 | École Polytechnique | 2 |
| 262 | École Supérieure de Physique et Chimie | 1 |
| 263 | École municipale de physique et de chimie indu... | 1 |
264 rows × 2 columns
top20_organisations = df_organisations.sort_values("prize")[-20:]
top20_organisations.tail()
| organization_name | prize | |
|---|---|---|
| 198 | University of Chicago | 20 |
| 117 | Massachusetts Institute of Technology (MIT) | 21 |
| 167 | Stanford University | 23 |
| 68 | Harvard University | 29 |
| 196 | University of California | 40 |
org_h_bar = px.bar(x=top20_organisations.prize,
y=top20_organisations.organization_name,
color=top20_organisations.prize,
color_continuous_scale=px.colors.sequential.haline,
title="Top 20 Research Institutions by Number of Prizes",
orientation='h')
org_h_bar.update_layout(xaxis_title="Number of Prizes",
yaxis_title="Organisation",
coloraxis_showscale=False)
org_h_bar.show()
Where do major discoveries take place?
Challenge:
df_city = df_data.groupby("organization_city", as_index=False).prize.count()
top20_city = df_city.sort_values("prize")[-20:]
top20_city.tail()
| organization_city | prize | |
|---|---|---|
| 128 | Paris | 25 |
| 92 | London | 27 |
| 33 | Cambridge | 31 |
| 121 | New York, NY | 45 |
| 34 | Cambridge, MA | 50 |
city_h_bar = px.bar(x=top20_city.prize,
y=top20_city.organization_city,
color=top20_city.prize,
color_continuous_scale=px.colors.sequential.solar,
title="Top 20 Research Cities by Number of Prizes",
orientation='h')
city_h_bar.update_layout(xaxis_title="Number of Prizes",
yaxis_title="City",
coloraxis_showscale=False)
city_h_bar.show()
Challenge:
Plasma for the chart.df_birth = df_data.groupby("birth_city", as_index=False).prize.count()
top20_birth = df_birth.sort_values("prize")[-20:]
top20_birth.tail()
| birth_city | prize | |
|---|---|---|
| 112 | Chicago, IL | 12 |
| 572 | Vienna | 14 |
| 313 | London | 19 |
| 418 | Paris | 26 |
| 382 | New York, NY | 53 |
birth_h_bar = px.bar(x=top20_birth.prize,
y=top20_birth.birth_city,
color=top20_birth.prize,
color_continuous_scale=px.colors.sequential.Plasma,
title="Top 20 Birth Cities by Number of Prizes",
orientation='h')
birth_h_bar.update_layout(xaxis_title="Number of Prizes",
yaxis_title="City",
coloraxis_showscale=False)
birth_h_bar.show()
Challenge:
country_city_org = df_data.groupby(by=['organization_country',
'organization_city',
'organization_name'], as_index=False).agg({'prize': pd.Series.count})
country_city_org = country_city_org.sort_values('prize', ascending=False)
country_city_org.head()
| organization_country | organization_city | organization_name | prize | |
|---|---|---|---|---|
| 205 | United States of America | Cambridge, MA | Harvard University | 29 |
| 280 | United States of America | Stanford, CA | Stanford University | 23 |
| 206 | United States of America | Cambridge, MA | Massachusetts Institute of Technology (MIT) | 21 |
| 209 | United States of America | Chicago, IL | University of Chicago | 20 |
| 195 | United States of America | Berkeley, CA | University of California | 19 |
burst = px.sunburst(country_city_org,
path=['organization_country', 'organization_city', 'organization_name'],
values='prize',
title='Where do Discoveries Take Place?',
)
burst.update_layout(xaxis_title='Number of Prizes',
yaxis_title='City',
coloraxis_showscale=False)
burst.show()
birth_years = df_data.birth_date.dt.year
df_data['winning_age'] = df_data.year - birth_years
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 | 49.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 | 62.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 | 47.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 | 79.00 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 | 73.00 |
Challenge:
bins to see how the visualisation changes.display(df_data.nlargest(n=1, columns='winning_age'))
display(df_data.nsmallest(n=1, columns='winning_age'))
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 937 | 2019 | Chemistry | The Nobel Prize in Chemistry 2019 | “for the development of lithium-ion batteries” | 1/3 | Individual | John Goodenough | 1922-07-25 | Jena | Germany | Germany | Male | University of Texas | Austin TX | United States of America | DEU | 0.33 | 97.00 |
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 885 | 2014 | Peace | The Nobel Peace Prize 2014 | "for their struggle against the suppression of... | 1/2 | Individual | Malala Yousafzai | 1997-07-12 | Mingora | Pakistan | Pakistan | Female | NaN | NaN | NaN | PAK | 0.50 | 17.00 |
df_data.winning_age.describe()
count 934.00 mean 59.95 std 12.62 min 17.00 25% 51.00 50% 60.00 75% 69.00 max 97.00 Name: winning_age, dtype: float64
bin size. Try 10, 20, 30, and 50. plt.figure(figsize=(8, 4), dpi=200)
sns.histplot(data=df_data,
x=df_data.winning_age,
bins=30)
plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()
Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?
Challenge
lowess parameter to True to show a moving average of the linear fit.plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.regplot(data=df_data,
x='year',
y='winning_age',
scatter_kws = {'alpha': 0.4},
line_kws={'color': 'black'})
plt.show()
How does the age of laureates vary by category?
.boxplot() to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"? plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.boxplot(data=df_data,
x='category',
y='winning_age')
plt.show()
Challenge
.lmplot() and the row parameter to create 6 separate charts for each prize category. Again set lowess to True..lmplot() telling a different story from the .boxplot()?.lmplot() to put all 6 categories on the same chart using the hue parameter. with sns.axes_style('whitegrid'):
sns.lmplot(data=df_data,
x='year',
y='winning_age',
row = 'category',
aspect=2,
scatter_kws = {'alpha': 0.6},
line_kws = {'color': 'black'},)
plt.show()
with sns.axes_style("whitegrid"):
sns.lmplot(data=df_data,
x='year',
y='winning_age',
hue='category',
aspect=2,
scatter_kws={'alpha': 0.5},
line_kws={'linewidth': 5})
plt.show()